Using Unlabeled Ema Data in a Speech Production Model with a Rich Memory
نویسندگان
چکیده
We present a pilot study which integrates articulatory information into the Context Sequence Model (CSM) of speech production [1]. The CSM is an exemplar-theoretic model which builds on the concept of the speech perception—production loop and incorporates a rich acoustic memory of past speech items which are stored sequentially in their original context. In the present study, we enrich the original acoustic memory of the CSM with articulatory information by using continuous Electromagnetic Midsaggital Articulography (EMA) measurements. To our knowledge, there are no existing speech production models which use the full continuous EMA signals directly and in the same way as acoustic speech signals. In a first series of experiments, we used data from a Polish corpus [2] designed to investigate the coordination between articulatory gestures within syllables in onset and coda positions (particularly the so-called C-Center effect — a distance of the consonants in a cluster with regards to a vowel [3]). The corpus is composed of a set of repeated target words with simple onsets and codas containing single sonorants, as well as onset and coda clusters containing a voiceless stop and a sonorant, embedded into carrier phrases which guarantee identical contexts of tongue movements for all target consonants and clusters. Their structure is as follows (target words are underlined): onset: “Ona mówi pranie aktualnie” (“She is saying laundry currently”); coda: “Ona powiedziała Cypr aktualnie” (“She is saying Cyprus currently”). The database contained speech recordings and EMA measurements from one male and two female native speakers, recorded with a 2D Electromagnetic Articulograph, Carstens AG100. The EMA data was sampled at 400 Hz, postprocessed, and manually annotated with phone segments and articulatory landmarks using the EMU Speech Database System [http://emu.sourceforge.net]. The target words were recorded with an emphasis and a non-emphasis articulation mode. We selected the phonetically labeled (C)CV and VC(C) sequences from approximately 670 target words for our production simulation. Our implementation of the simulation reproduces the original CSM (see [1] for a detailed description) with two important modifications: First, we run the simulations on three different conditions: (i) using only the acoustic speech recordings according to the original CSM, (ii) using the continuous EMA signals instead of the acoustic data, and (iii) using a multidimensional combined representation of both acoustic and EMA data. Second, we consider only the left context for the context matching procedure. When selecting an item for production from a set (or “cloud”) of candidate exemplars, the CSM uses a left and a right context, comparing the candidates’ original contexts with the context of the currently produced utterance. The left context stretches into the past and contains the acoustic signal (in our case also the EMA signals) of what was produced preceding the current target segment, whereas the right context estimates what is going to be produced in the future and contains the linguistic information (the phone labels) of what should be produced next (or what was originally produced after the respective candidate exemplars). Our results indicate that it might be possible to incorporate articulatory information into speech perception—production models using raw EMA data (without having to manually label specific articulatory landmarks). This also allows using unlabeled EMA traces in acquisition models without having to justify the employment of an a priori defined set of discrete gestural features or landmarks. We are currently in the process of repeating the initial simulations using a database of English speech and EMA recordings (MOCHA-TIMIT corpus).
منابع مشابه
Relationship between Working Memory, Auditory Perception and Speech Intelligibility in Cochlear Implanted Children of Elementary School
Objectives: This study examined the relationship between working and short-term memory performance, and their effects on cochlear implant outcomes (speech perception and speech production) in cochlear implanted children aged 7-13 years. The study also compared the memory performance of cochlear implanted children with their normal hearing peers. Methods: Thirty-one cochlear impl...
متن کاملSession 2aSC: Linking Perception and Production (Poster Session) 2aSC47. Acoustic and articulatory information as joint factors coexisting in the context sequence model of speech production
This simulation study presents the integration of an articulatory factor into the Context Sequence Model (CSM) (Wade et al., 2010) of speech production using Polish sonorant data measured with the Electromagnetic Articulograph technology (EMA) (Mücke et al., 2010). Based on exemplar-theoretic assumptions (Pierrehumbert 2001), the CSM models the speech production-perception loop operating on a s...
متن کاملSpeech Events are Recoverable from Unlabeled Articulatory Data: Using an Unsupervised Clustering Approach on Data Obtained from Electromagnetic Midsaggital Articulography (EMA)
Some models of speech perception/production and language acquisition make use of a quasi-continuous representation of the acoustic speech signal. We investigate whether such models could potentially profit from incorporating articulatory information in an analogous fashion. In particular, we investigate how articulatory information represented by EMA measurements can influence unsupervised phon...
متن کاملParallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach
There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...
متن کاملReconstruction of mistracked articulatory trajectories
Kinematic articulatory data are important for researches of speech production, articulatory speech synthesis, robust speech recognition, and speech inversion. Electromagnetic Articulograph (EMA) is a widely used instrument for collecting kinematic articulatory data. However, in EMA experiment, one or more coils attached to articulators are possible to be mistracked due to various reasons. To ma...
متن کامل